i) Account : Account number for transactional purpose unique code (for cashiering and accounting personnel to reference).
ii) Guest Name : Name of a guest staying at hotel.
iii) Arrive : Guest arrival (entry) date.
iv) Depart : Guest exit date.
v) Nights : Total number of nights (days) a guest is staying.
vi) Status : Room current status
O (Occupied), R (Reserved), I (Inspected), N (No-Show), C (Cancelled) 1) No-Show: A reservation becomes a No-Show when the customer who has a guaranteed reservation does not cancel it before the hotel’s cancellation deadline, and never arrives to claim the reservation.
2) Occupied: Customer booked the room and actually he arrived as per check-in date.
3) Reserved: If the guest has not claimed or canceled the booked room by the specified time with hotel policies, you must hold the reserved rooms available until check-out time the following day.
4) Canclled: Customer has been cancelled the booked room successfully within the time period of cancellation policies.
5) Inspected: The main Purpose of a room inspection is to catch any problems that may have been overlooked during the cleaning before it is found by the guest and becomes a dissatisfaction and complaint. After guest check-out, manager is responsible for the room inspection before it allocated to new customer.
vii) Rate : The price of hotel charges for overnight accommodations (Room rates).
viii) Rate code : Room Rates and rate codes may also vary according to the available room features such as room size, location, view, furnishing, decors, competitors pricing. Similar to below references,
1) Early-bird Rate - This type of rates are only open X days before arrival. Eg: Open only when 7 Days before arrival.
2) Package Rate: Rates that includes a guest room in combination with other available events or activities. ( Eg: Best of London package which includes room rental, all meals, site seeing, airport transfers etc.)
3) Best Available Rates (BAR): These rate codes are the lowest discounted rate available for a day which can be offered to the guest by the Reservation or Front desk staff. BAR can be of different types
BAR Level Open / Close when Occupancy Between DATE DAY RATE
BAR - 01 0% TO 25 % 19/10/2016 WED 200.00
BAR - 02 26 % TO 35 % 20/10/2016 THU 200.00
BAR - 03 36% TO 50% 21/10/2016 FRI 250.00
BAR - 04 51% TO 75% 22/10/2016 SAT 250.00
BAR - 05 76% TO 100% 23/10/2016 SUN 250.00
24/10/2016 MON 200.00
25/10/2016 TUE 200.00
4) Family Rate: A rate reserved for families with children. Usually, these rates include Extra Bed charges and may also include some free add-on activities for children
ix) Room : Respetive room numbers assigned to rooms as per the hotel configurations.
x) Type :
1) NQQ - Double Queen Rooms (Non Smoking 2 Queen bed)
2) NK - Single King Rooms (Non Smoking Single King bed)
3) SNQQ - Double Queen Suites (Suites Non Smoking 2 Queen bed)
4) SNQQ1 - Double Queen Suites (Suites Non Smoking 2 Queen bed)
5) SNK - Single King Suites (Suites Non Smoking Single King bed)
6) NHQQ - Double Queen Room Handicap Equipped (Non Smoking Handicap 2 Queen bed)
7) NHK - Single King Room Handicap Equipped (Non Smoking Handicap Single King bed)
8) SNHK - Single King Suites Handicap Equipped (Suites Non Smoking Handicap Single King bed)
Note: i) 1 Queen bed --> Bed = 1 | Max adults = 2
ii) 2 Queen bed --> Bed = 2 | Max adults = 4
ref: https://www.comfortinnofblueridge.com/accommodations/rooms/
xi) Source : The room booking mode which is used by guest
1) CRS Central Reservation System (online mode) :15714 2) DIRECT Direct (direct mode) :9330 3) WI Walk In (direct mode) :2423 xii) CRS Conf No : N.A
xiii) GTD : N.A
Guaranteed --> Prior to a function, the figure given by a meeting planner to the property for the number of persons to be served.
VI 7565
MC 4808
CC 4704
DB 3661
CA 2985
GM 2786
AX 673
DS 255
DP 13
CK 9
4P 6
xiv) Reserve Date : Room reservation date (On this day, guest reserve/ book the rooms by entering details as arrival, depart, adults, bed, etc)
xv) User : Receiption desk person id which is responsible for handling the guest activities
xvi) Shared Account : Is any shared / same account used for booking.
xvii) Track Code : Purpose for the room booking
Corporate (Meeting, signing, dealing) Leisure (Rest) Walk-in (Other purposes on quick decision) xviii) Package : Complete package cost for the guest including per person charge, inclusive of guest room, food and beverage, and proportional percentage of all other event charges.
(Packages seasonal - Winter, Spring, Summer) xix) Cancellation Date : The date on which cancelation request initiated for booked/ reserved room.
xx) CXL User ID : Person Id (receiption desk person/ individual) who is responsible to cancle the booking/ reservation room.
i) Date : Date of the occupancy
ii) Day : Day associated with date
iii) Rooms : Total number of the rooms in hotel
iv) OOO : OOO (Rooms that are in Out Of Order) is typically used when a room is being renovated, undergoing repairs and cannot be used.
v) StayOver : The guest is not expected to check out today and will remain at least one more night or may be week.
vi) Arrivals : Displays all guests arriving on a certain day or within a certain set time period.
vii) DueOut : The room is expected to become vacant after the following day’s checkout time.
viii) Available : Total numbers of available rooms (-ve sign indicates overdue)
ix) Group Block : Booking the rooms in a group or package (Ex. For event, people book multiple rooms)
x) Group Picked Up : Booking the rooms in a group.
xi) TransNGTD : NA.
*GTD: Guaranteed. Prior to a function, the figure given by a meeting planner to the property for the number of persons to be served.
xii) TransGTD : NA.
*GTD: Guaranteed. Prior to a function, the figure given by a meeting planner to the property for the number of persons to be served.
xiii) Occupied : Total number of rooms are in used/ occupied (Guest is currently registered to the room). [Rooms - Available - GroupBlock + GroupPickedUp]
xiv) OccPercentage : Room coocupany rate on perticular day (occupied rooms / total available rooms). The occupancy rate will tell you exactly how full your hotel is at any point in time.
xv) RoomRev : Total revenue generated from occupied rooms on perticulaar day
xvi) RevPAR : Revenue Per Available Rooms (total revenue / total available rooms). This metric provides a glimpse into the number of rooms that are being sold at a hotel and how much revenue is being generated from those bookings. You should use RevPAR to understand the best way to maximise the revenue generated per room. If the RevPAR of your property is increasing, it must mean your average room rate or occupancy rate is increasing – or both!
xvii) ADR : Average daily rate of guest on perticular day (total revenue / rooms occupied). it is calculated using the amount of revenue earned and the number of rooms sold to give you an average rate. [Ref: https://www.siteminder.com/r/calculate-revpar/]
1) ADR will simply tell you how much revenue each sold room is selling for on average, while RevPAR will tell you how much revenue you’re bringing in for all your rooms.
2) Since ADR simply indicates the price of your rooms while RevPAR will tell you how much money you yield from each room, sold or not.
viii) Ppl : NA.
**Formulaes Details - Traditionally, the effectiveness of revenue management strategy is measured using the following KPIs
a) Occupancy rate — the number of occupied rental units at a given time, compared to the total number of available rental units at that time.
Occupancy rate = Rooms sold / Room available
b) Average daily rate (ADR) — this rate is applied to a room’s average rental income during a certain period. It’s compared to hotel’s historical ADR or competitors’ metrics.
ADR = Rooms revenue earned / Number of rooms sold
c) Revenue per available room (RevPAR) — a KPI that assess financial and business performance of a hotel. RevPAR measures ability of a property to fill all the rooms and define the best price for them. There are two ways to calculate it:
RevPAR = Rooms revenue / Rooms available
RevPAR = Average daily rate * Occupancy rate
d) Total revenue per available room (TRevPAR) — a metrics that accesses total revenue, generated by property and based on room cost and money spent on it. This KPI captures a snapshot of overall business performance. TRevPAR is one of the main benchmarking tools for big hotels and resorts.
The higher the TRevPAR, — the better the revenue.
TRevPAR = Total revenue / Total number of available rooms
e) Net revenue per available room (NRevPAR) — a KPI that allows hotel revenue managers to calculate the distribution cost to see how the room revenue is generated. NRevPAR includes spending on marketing and distribution.
NRevPAR = (Room revenue — distribution costs) / Number of available rooms
f) Gross operating profit per available room (GOPPAR) — measures the profit of a hotel and value of all assets at any given time. GOPPAR measures profit to capacity, including all a hotel’s spending and taxes.
GOPPAR = Gross Operating Profit / Number of available rooms
Ref: https://www.siteminder.com/r/calculate-revpar/ | https://www.upstay.tech/hotel-revenue-management-formulas-kpis-calculations-use-cases/
'''
Library and packages
'''
Importing the data '.txt' from the source and save in '.csv' format
'''
Reservation data and Occupancy data
'''
As we have the raw data and need to analyze, prepare for business case studies. This stage involves data preprocessing, data cleaning and data understanding.
i) Column rename
ii) Conversion of dataypes
iii) Replacenemt of unnecessary data values and EDA
iv) Missing value analysis
v) Date param extractions
vi) Data transformations and experiments
vii) Data insights
'''
Size of reservation data
'''
Dimenssion of Reservation data : (40604, 7)
Observation:
'''
Number of unique data categories?
'''
RateCode 103 Type 8 Source 3 TrackCode 4 dtype: int64
Observations:
In given property data,
'''
Type of rooms
'''
Observations:
'''
Type of rate codes
'''
Observations:
'''
Type of track code
'''
Observations:
'''
Total nights or stays analysis from data
'''
| Night (stays) | 1 | 2 | 5 | 3 | 4 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|
| Total counts | 25790.00 | 6923.00 | 2551.00 | 2339.00 | 2036.00 | 310.00 | 187.00 | 82.0 |
| % counts | 63.52 | 17.05 | 6.28 | 5.76 | 5.01 | 0.76 | 0.46 | 0.2 |
Observations:
i) Column rename
ii) Conversion of dataypes
iii) Replacenemt of unnecessary data values
iv) Missing value analysis
v) Date param extractions
vi) Data transformations and experiments
vii) Data insights and EDA
'''
Occupancy data size
'''
Dimenssion of Occupancy data : (2922, 18)
Observation:
Observation:
We have 4 dataypes, and for machine learning, we need all inputs in numeric format strickly
Column Name Datatype format
-------------------------------
Date datetime
Year int
Month int
Days int
Nights int
Type object
TrackCode object
RateCode object
Rate float
Occupied int
ADR float
'''
Below table is the yield data representation
'''
| OccupancyLevel | 0Day | 1Day | 2Day | 7Day | 15Day | 30Day | 60Day | 90Day | MaxDay | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0-10 | -0.20 | -0.17 | -0.15 | -0.12 | -0.10 | -0.07 | -0.05 | -0.02 | 0.02 |
| 1 | 11-20 | -0.17 | -0.15 | -0.12 | -0.10 | -0.07 | -0.05 | -0.02 | 0.02 | 0.05 |
| 2 | 21-30 | -0.15 | -0.12 | -0.10 | -0.07 | -0.05 | -0.02 | 0.02 | 0.05 | 0.07 |
| 3 | 31-40 | -0.12 | -0.10 | -0.07 | -0.05 | -0.02 | 0.02 | 0.05 | 0.07 | 0.10 |
| 4 | 41-50 | -0.10 | -0.07 | -0.05 | -0.02 | 0.02 | 0.05 | 0.07 | 0.10 | 0.12 |
| 5 | 51-60 | -0.07 | -0.05 | -0.02 | 0.02 | 0.05 | 0.07 | 0.10 | 0.12 | 0.15 |
| 6 | 61-70 | -0.05 | -0.02 | 0.02 | 0.05 | 0.07 | 0.10 | 0.12 | 0.15 | 0.17 |
| 7 | 71-80 | -0.02 | 0.02 | 0.05 | 0.07 | 0.10 | 0.12 | 0.15 | 0.17 | 0.20 |
| 8 | 81-90 | 0.02 | 0.05 | 0.07 | 0.10 | 0.12 | 0.15 | 0.17 | 0.20 | 0.20 |
| 9 | 91-100 | 0.05 | 0.07 | 0.10 | 0.12 | 0.15 | 0.17 | 0.20 | 0.20 | 0.20 |
Here, we are adding up the holidays in US region from python the package.
Package name : holidays
Desciption : US federal holiday calendar
ref: https://pypi.org/project/holidays/
Note: Initially we checking out (1) how much model learn/ capture this holiday effect? (2) And there is price change with this effect or not? (3) Holiday data is very very low as compared to actual data length, but with this initial data, we are trying to see is chance is happend or not?
$\color{Blue}{\text{,,}}$
'''
Holiday dates and descriptions
'''
2022-01-01 - New Year's Day 2022-01-17 - Martin Luther King Jr. Day 2022-02-21 - Washington's Birthday 2022-05-30 - Memorial Day 2022-06-19 - Juneteenth National Independence Day 2022-06-20 - Juneteenth National Independence Day (Observed) 2022-07-04 - Independence Day 2022-09-05 - Labor Day 2022-10-10 - Columbus Day 2022-11-11 - Veterans Day 2022-11-24 - Thanksgiving 2022-12-25 - Christmas Day 2022-12-26 - Christmas Day (Observed)
Obsercation:
Here, we are extracting Day of week, weekend, month name, day name, week name, etc... from Date column.
i) Extraction of data parameters
ii) Pre-processing data
iii) Validate data quality
iV) Analyzing the data and EDA
Note: Direct Date (datatype: datetimens) is not handle by ML model. Machine Learning algortihm expects data is in numeric form. Hence we are extracting requied fields from Date directly.
Currently taking event data from web portal and save event data from date range 2017 to 2027 in excel file.
ref: https://www.timeanddate.com/holidays/us/super-bowl
Ex. (1) With reference link, in year 2022 below are the major sporting events.
13 Feb Sunday Super Bowl Sporting event
18 Apr Monday Boston Marathon Sporting event
6 May Friday Kentucky Oaks Sporting event
7 May Saturday Kentucky Derby Sporting event
21 May Saturday Preakness Stakes Sporting event
11 Jun Saturday Belmont Stakes Sporting event
6 Nov Sunday New York City Marathon Sporting event
Note: Initially we checking out (1) how much model learn/ capture this effect? (2) And there is price change with this effect or not? (3) Event data is very very low as compared to actual data length, but with this initial data, we are trying to see is chance is happend or not?
As Machien Learning algorithms expects all inputs must be in numeric format, hence in this step, we are converting non-numeric (Obejct) form to numeric (int) using AI techniques.
In this section, we are applying Machine Learning algorithms to our data. The business case is related to prediction, hence applying ML Regression Algorithms and see the results on given data.
Input features : Year, Month, Days, Nights, Type, RateCode, TrackCode, Occupied
Output feature : Rate
Models : Linear model, Decision model, Boosting model
ML algorithms : LinearRegression, SVR, RandomForestRegression, GradientBoostingRegressor, XGBRegressor, KNeighborsRegressor
Matrics : MAE, RMSE, R2
Strategy - Considering 100% scale, about 70% data is use for model training purpose and remaining 30% data is use for testing purpose. So model is build on 70% of data and we will test the model (how model is being perform/ model behaviour) on 30% data which is unseen by model. This section is base model
'''
Data spliting for Machine Learning models
'''
Models :
Cross Validation :
Accuracy measures :
Error measure :
Keypoints - For best model performance, below are the points to be consider for model selection as
ref: https://scikit-learn.org/stable/model_selection.html#model-selection
'''
Machine Learning Model 1
'''
ML Algorithm : LinearRegression() R2 Score (train) : 0.35311734277965434 R2 Score (test) : 0.3454420452832617 CV Scores : [0.32823183 0.37244873 0.34333893 0.35166847 0.34916579] CV Scores mean : 0.3489707491837339 MAE rate : 16.721313679138696 MSE rate : 483.75551118371715 RMSE rate : 21.994442734102567
Wall time: 334 ms
| Actual | Prediction | Diff | |
|---|---|---|---|
| 5834 | 104.50 | 103.774546 | 0.73 |
| 35404 | 110.00 | 96.340908 | 13.66 |
| 12992 | 144.32 | 110.330746 | 33.99 |
| 36147 | 99.00 | 89.018656 | 9.98 |
| 30203 | 125.10 | 106.115283 | 18.98 |
| ... | ... | ... | ... |
| 28410 | 90.00 | 86.215416 | 3.78 |
| 14431 | 70.00 | 76.870664 | -6.87 |
| 17721 | 99.00 | 98.217732 | 0.78 |
| 1324 | 65.00 | 73.770245 | -8.77 |
| 1559 | 94.05 | 95.715983 | -1.67 |
11925 rows × 3 columns
'''
Machine Learning Model 2
'''
ML Algorithm : DecisionTreeRegressor() R2 Score (train) : 0.9696395461357734 R2 Score (test) : 0.774123806817491 CV Scores : [0.78076799 0.78132714 0.77070523 0.77360479 0.76227651] CV Scores mean : 0.7737363320700531 MAE rate : 6.421869970986376 MSE rate : 166.93533782585078 RMSE rate : 12.920345886463364
Wall time: 2.61 s
| Actual | Prediction | Diff | |
|---|---|---|---|
| 5834 | 104.50 | 97.950000 | 6.55 |
| 35404 | 110.00 | 110.000000 | 0.00 |
| 12992 | 144.32 | 125.000000 | 19.32 |
| 36147 | 99.00 | 89.000000 | 10.00 |
| 30203 | 125.10 | 122.550000 | 2.55 |
| ... | ... | ... | ... |
| 28410 | 90.00 | 90.000000 | 0.00 |
| 14431 | 70.00 | 109.857143 | -39.86 |
| 17721 | 99.00 | 99.000000 | 0.00 |
| 1324 | 65.00 | 65.000000 | 0.00 |
| 1559 | 94.05 | 125.550000 | -31.50 |
11925 rows × 3 columns
'''
Machine Learning Model 3
'''
ML Algorithm : RandomForestRegressor() R2 Score (train) : 0.955801467567185 R2 Score (test) : 0.84451992913283 CV Scores : [0.84680105 0.85130809 0.84663149 0.8446779 0.84471015] CV Scores mean : 0.846825734258647 MAE rate : 5.706260409024578 MSE rate : 114.90860453109552 RMSE rate : 10.719543112049857
Wall time: 1min 20s
| Actual | Prediction | Diff | |
|---|---|---|---|
| 5834 | 104.50 | 95.535950 | 8.96 |
| 35404 | 110.00 | 110.931429 | -0.93 |
| 12992 | 144.32 | 131.873800 | 12.45 |
| 36147 | 99.00 | 88.898238 | 10.10 |
| 30203 | 125.10 | 124.260000 | 0.84 |
| ... | ... | ... | ... |
| 28410 | 90.00 | 90.000000 | 0.00 |
| 14431 | 70.00 | 76.117600 | -6.12 |
| 17721 | 99.00 | 99.000000 | 0.00 |
| 1324 | 65.00 | 65.000000 | 0.00 |
| 1559 | 94.05 | 101.717297 | -7.67 |
11925 rows × 3 columns
'''
Machine Learning Model 4
'''
R2 Score (train) : 0.8938754536203146 R2 Score (test) : 0.8560536031167904 CV Scores : [0.85679053 0.85825166 0.85907572 0.85857106 0.86630919] CV Scores mean : 0.8597996296962591 MAE rate : 6.579201321672508 MSE rate : 100.45588154579252 RMSE rate : 10.022768157839057
Wall time: 24.9 s
| Actual | Prediction | Diff | |
|---|---|---|---|
| 68111 | 101.412908 | 105.944046 | -4.53 |
| 4819 | 112.075894 | 104.399254 | 7.68 |
| 57305 | 89.100000 | 100.965729 | -11.87 |
| 36592 | 76.037644 | 77.574059 | -1.54 |
| 70129 | 100.000000 | 125.763565 | -25.76 |
| ... | ... | ... | ... |
| 430 | 99.000000 | 132.476593 | -33.48 |
| 18238 | 110.000000 | 120.430710 | -10.43 |
| 10251 | 118.703046 | 113.550629 | 5.15 |
| 56733 | 114.000000 | 108.585419 | 5.41 |
| 53303 | 80.797862 | 85.859436 | -5.06 |
18153 rows × 3 columns
'''
Machine Learning Model 5
'''
ML Algorithm : KNeighborsRegressor() R2 Score (train) : 0.8390347527774555 R2 Score (test) : 0.7474403731818305 CV Scores : [0.76321282 0.76464483 0.7512027 0.76124977 0.75407274] CV Scores mean : 0.7588765720187757 MAE rate : 7.888033542976939 MSE rate : 186.6559110547589 RMSE rate : 13.662207400517637
Wall time: 7.27 s
| Actual | Prediction | Diff | |
|---|---|---|---|
| 5834 | 104.50 | 96.142 | 8.36 |
| 35404 | 110.00 | 109.000 | 1.00 |
| 12992 | 144.32 | 90.308 | 54.01 |
| 36147 | 99.00 | 89.800 | 9.20 |
| 30203 | 125.10 | 119.700 | 5.40 |
| ... | ... | ... | ... |
| 28410 | 90.00 | 90.000 | 0.00 |
| 14431 | 70.00 | 70.000 | 0.00 |
| 17721 | 99.00 | 99.000 | 0.00 |
| 1324 | 65.00 | 65.000 | 0.00 |
| 1559 | 94.05 | 104.236 | -10.19 |
11925 rows × 3 columns
'''
Machine Learning Model 6
'''
ML Algorithm : GradientBoostingRegressor() R2 Score (train) : 0.7627236935074548 R2 Score (test) : 0.7590169985129133 CV Scores : [0.7646562 0.75746387 0.75574216 0.76197451 0.75006547] CV Scores mean : 0.7579804411893672 MAE rate : 9.648806972266968 MSE rate : 178.10012731633682 RMSE rate : 13.345415966403475
Wall time: 29.3 s
| Actual | Prediction | Diff | |
|---|---|---|---|
| 5834 | 104.50 | 102.145549 | 2.35 |
| 35404 | 110.00 | 112.841109 | -2.84 |
| 12992 | 144.32 | 116.144812 | 28.18 |
| 36147 | 99.00 | 97.072799 | 1.93 |
| 30203 | 125.10 | 120.867691 | 4.23 |
| ... | ... | ... | ... |
| 28410 | 90.00 | 94.731002 | -4.73 |
| 14431 | 70.00 | 98.006643 | -28.01 |
| 17721 | 99.00 | 96.071330 | 2.93 |
| 1324 | 65.00 | 64.372914 | 0.63 |
| 1559 | 94.05 | 99.527762 | -5.48 |
11925 rows × 3 columns
We are comparing the model performaces, so that best model we can choose for tunning and predictions.
Table columns:
ref: (Cross validation) https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation
'''
Machine Learning Model performanance analysis
'''
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.300000012, max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)
| ML Algorithms | Training Score | CV Mean Score | Testing score | MAE Rate | MSE Rate | RMSE Rate | |
|---|---|---|---|---|---|---|---|
| 0 | XGBRegressor(base_score=0.5, booster='gbtree',... | 0.889591 | 0.845926 | 0.846648 | 6.769352 | 113.335624 | 10.645921 |
| 1 | (DecisionTreeRegressor(max_features='auto', ra... | 0.955801 | 0.846826 | 0.844520 | 5.706260 | 114.908605 | 10.719543 |
| 2 | DecisionTreeRegressor() | 0.969640 | 0.773736 | 0.774124 | 6.421870 | 166.935338 | 12.920346 |
| 3 | ([DecisionTreeRegressor(criterion='friedman_ms... | 0.762724 | 0.757980 | 0.759017 | 9.648807 | 178.100127 | 13.345416 |
| 4 | KNeighborsRegressor() | 0.839035 | 0.758877 | 0.747440 | 7.888034 | 186.655911 | 13.662207 |
| 5 | LinearRegression() | 0.353117 | 0.348971 | 0.345442 | 16.721314 | 483.755511 | 21.994443 |
Observations:
Note: As top 2 models are XGBRegressor (83%) + RandomForestRgressor (83%). Previously we got 80% accuracy, here we improve accuracy about 3%. For finding best situable parameter, we need to perform hyperparameter tunning and using this we can enhance our model with best results.
'''
Selection of model
'''
Experiment on instance / data balancing and then approaching towards the modeling
ref: https://imbalanced-learn.org/stable/references/over_sampling.html
'''
Class distrubution in data (before applying ML technique)
'''
Class=5, n=2787 (7.012%) Class=3, n=16160 (40.657%) Class=0, n=842 (2.118%) Class=2, n=11455 (28.820%) Class=7, n=2975 (7.485%) Class=6, n=3770 (9.485%) Class=4, n=404 (1.016%) Class=1, n=1354 (3.407%)
Observation:
'''
Class distrubution in data (after applying ML technique)
'''
Class=5, n=16160 (12.500%) Class=3, n=16160 (12.500%) Class=0, n=16160 (12.500%) Class=2, n=16160 (12.500%) Class=7, n=16160 (12.500%) Class=6, n=16160 (12.500%) Class=4, n=16160 (12.500%) Class=1, n=16160 (12.500%)
Observation:
'''
Machine Learning Model 1
'''
Training score : 0.8745748050840594 Testing score : 0.8535179553573975 CV score : 0.8567685849825789 MAE score : 6.632022970710143 RMSE score : 10.255269604586836 Wall time: 48.1 s
'''
Machine Learning Model 2
'''
Training score : 0.9793532202028591 Testing score : 0.9132275765652069 CV score : 0.9187644925110348 MAE score : 3.5940022690171674 RMSE score : 7.893066266251923 Wall time: 6min 16s
'''
Machine Learning Model 3
'''
Training score : 0.9205924751638159 Testing score : 0.8695406802808101 CV score : 0.8787533460366079 MAE score : 4.522488247984103 RMSE score : 9.678152036944146 Wall time: 28.4 s
'''
Machine Learning Model 4
'''
Training score : 0.7541828431166233 Testing score : 0.7485207184275926 CV score : 0.7507311202505519 MAE score : 9.657718644779559 RMSE score : 13.4371136682871 Wall time: 1min 19s
'''
Class distrubution in data (before applying ML technique)
'''
Class=5, n=2787 (7.012%) Class=3, n=16160 (40.657%) Class=0, n=842 (2.118%) Class=2, n=11455 (28.820%) Class=7, n=2975 (7.485%) Class=6, n=3770 (9.485%) Class=4, n=404 (1.016%) Class=1, n=1354 (3.407%)
Observation:
'''
Class distrubution in data (after applying ML technique)
'''
Class=5, n=16105 (12.437%) Class=3, n=16160 (12.480%) Class=0, n=16150 (12.472%) Class=2, n=14845 (11.464%) Class=7, n=16587 (12.809%) Class=6, n=17320 (13.375%) Class=4, n=16249 (12.548%) Class=1, n=16076 (12.415%)
Observation:
'''
Machine Learning Model 1
'''
Training score : 0.8731162214879863 Testing score : 0.8489802923910031 CV score : 0.8500394846448881 MAE score : 6.73801066185894 RMSE score : 10.33483522424665 Wall time: 47.5 s
'''
Machine Learning Model 2
'''
Training score : 0.9792815507464897 Testing score : 0.9103913993659695 CV score : 0.9148825285849524 MAE score : 3.7761393342396534 RMSE score : 7.96088942721416 Wall time: 5min 14s
'''
Machine Learning Model 3
'''
Training score : 0.9155600141742807 Testing score : 0.8644597850073013 CV score : 0.8718770580740648 MAE score : 4.7226115720474935 RMSE score : 9.790859744763223 Wall time: 27.9 s
'''
Machine Learning Model 4
'''
Training score : 0.7452501601913991 Testing score : 0.7414846585541073 CV score : 0.7416884782824307 MAE score : 9.684744928204143 RMSE score : 13.521656678162675 Wall time: 1min 23s
'''
Class distrubution in data (before applying ML technique)
'''
Class=5, n=2787 (7.012%) Class=3, n=16160 (40.657%) Class=0, n=842 (2.118%) Class=2, n=11455 (28.820%) Class=7, n=2975 (7.485%) Class=6, n=3770 (9.485%) Class=4, n=404 (1.016%) Class=1, n=1354 (3.407%)
Observation:
'''
Class distrubution in data (after applying ML technique)
'''
Class=0, n=12530 (17.819%) Class=1, n=11097 (15.781%) Class=2, n=5438 (7.734%) Class=3, n=4883 (6.944%) Class=4, n=13755 (19.561%) Class=5, n=8074 (11.482%) Class=6, n=6741 (9.587%) Class=7, n=7799 (11.091%)
Observation:
'''
Machine Learning Model 1
'''
Training score : 0.8941397629744505 Testing score : 0.8660863263330365 CV score : 0.8697151157669074 MAE score : 6.356461429731838 RMSE score : 9.664947005443487 Wall time: 28.4 s
'''
Machine Learning Model 2
'''
Training score : 0.9897971557861857 Testing score : 0.9419249084930137 CV score : 0.9466292200032888 MAE score : 2.7520375622818656 RMSE score : 6.364759036266756 Wall time: 2min 31s
'''
Machine Learning Model 3
'''
Training score : 0.7387654528118385 Testing score : 0.734542822239673 CV score : 0.73430856544963 MAE score : 9.923953411873862 RMSE score : 13.60768512455801 Wall time: 1min 18s
'''
Machine Learning Model performanance analysis
'''
RFR
| model | resample | trainingScore | testingScore | cvScore | maeRate | rmseRate | |
|---|---|---|---|---|---|---|---|
| 0 | RFR | SMOTEENN | 0.989797 | 0.941925 | 0.946629 | 2.752038 | 6.364759 |
| 1 | RFR | SMOTE | 0.979353 | 0.913228 | 0.918764 | 3.594002 | 7.893066 |
| 2 | RFR | ADASYN | 0.979282 | 0.910391 | 0.914883 | 3.776139 | 7.960889 |
| 3 | KNN | SMOTEENN | 0.942827 | 0.904470 | 0.912219 | 3.685423 | 8.163129 |
| 4 | KNN | SMOTE | 0.920592 | 0.869541 | 0.878753 | 4.522488 | 9.678152 |
| 5 | XGB | SMOTEENN | 0.894140 | 0.866086 | 0.869715 | 6.356461 | 9.664947 |
| 6 | KNN | ADASYN | 0.915560 | 0.864460 | 0.871877 | 4.722612 | 9.790860 |
| 7 | XGB | SMOTE | 0.874575 | 0.853518 | 0.856769 | 6.632023 | 10.255270 |
| 8 | XGB | ADASYN | 0.873116 | 0.848980 | 0.850039 | 6.738011 | 10.334835 |
| 9 | GBR | SMOTE | 0.754183 | 0.748521 | 0.750731 | 9.657719 | 13.437114 |
| 10 | GBR | ADASYN | 0.745250 | 0.741485 | 0.741688 | 9.684745 | 13.521657 |
| 11 | GBR | SMOTEENN | 0.738765 | 0.734543 | 0.734309 | 9.923953 | 13.607685 |
Observation:
We are comparing the model performaces, so that best model we can choose for tunning and predictions.
After data balacing mechanisms, we have seen the improvemtents in model performances as compared to the without effect of the balacing. But in this techniques there is chances of data points duplications and removals, as for balacing the data using AI techniques, different types of algorithms uses its internal procedure (Ex. targeting large values, distance based similarities, etc.).
'''
Machine Learning Model performanance analysis
'''
RFR
| model | resample | trainingScore | testingScore | cvScore | maeRate | rmseRate | |
|---|---|---|---|---|---|---|---|
| 0 | RFR | SMOTEENN | 0.988785 | 0.930989 | 0.933843 | 3.192555 | 6.908432 |
| 1 | RFR | SMOTE | 0.979931 | 0.898479 | 0.900772 | 4.194470 | 8.474127 |
| 2 | RFR | ADASYN | 0.979573 | 0.894553 | 0.899120 | 4.260626 | 8.569204 |
| 3 | KNN | SMOTEENN | 0.934814 | 0.894007 | 0.898520 | 3.954146 | 8.561704 |
| 4 | XGB | SMOTEENN | 0.894779 | 0.858841 | 0.864455 | 6.508719 | 9.880412 |
| 5 | KNN | ADASYN | 0.904166 | 0.847872 | 0.854149 | 5.191092 | 10.292665 |
| 6 | KNN | SMOTE | 0.907778 | 0.846231 | 0.856235 | 5.153720 | 10.429196 |
| 7 | XGB | SMOTE | 0.871480 | 0.843137 | 0.844500 | 6.861812 | 10.533588 |
| 8 | XGB | ADASYN | 0.865587 | 0.836351 | 0.838462 | 7.015679 | 10.675298 |
| 9 | GBR | SMOTE | 0.746092 | 0.743788 | 0.744723 | 9.577961 | 13.462192 |
| 10 | GBR | SMOTEENN | 0.740549 | 0.737515 | 0.735846 | 9.691569 | 13.473273 |
| 11 | GBR | ADASYN | 0.739438 | 0.733639 | 0.738456 | 9.730570 | 13.619417 |
Observation:
We are comparing the model performaces, so that best model we can choose for tunning and predictions.
After data balacing mechanisms and duplicate datapoints/ records removal, we have seen the accuracy drop by 1 to 3 units and error increment by 1 to 3 units as compared to the pevious data balacing experiment.
Results without eliminating duplicates data points as,
model resample trainingScore testingScore cvScore maeRate rmseRate
9 RFR SMOTEENN 0.989901 0.938760 0.946618 2.886929 6.538762
1 RFR SMOTE 0.979595 0.914869 0.918256 3.582198 7.786674
5 RFR ADASYN 0.979324 0.911031 0.913912 3.740118 7.994166
10 KNN SMOTEENN 0.944063 0.898050 0.912317 3.794717 8.436666
2 KNN SMOTE 0.919330 0.869398 0.878157 4.525902 9.644561
6 KNN ADASYN 0.916140 0.866234 0.872831 4.687299 9.802282
8 XGB SMOTEENN 0.897690 0.864710 0.869792 6.503276 9.718727
0 XGB SMOTE 0.877415 0.853478 0.857020 6.585655 10.215493
4 XGB ADASYN 0.869149 0.848388 0.848799 6.767652 10.435700
3 GBR SMOTE 0.751074 0.750621 0.747344 9.574245 13.327178
7 GBR ADASYN 0.745340 0.744134 0.744509 9.762927 13.556919
11 GBR SMOTEENN 0.743594 0.736216 0.739159 10.040380 13.570646
'''
Model selected for demonstration usages only
'''
model RFR resample SMOTEENN trainingScore 0.988785 testingScore 0.930989 cvScore 0.933843 maeRate 3.192555 rmseRate 6.908432 Name: 0, dtype: object
Observation:
Note: Hyperparam and model testing is pending...
Let's analyze how is the model perform in combination of event and holiday in past data (testing dataset). So that we can know the effect of those combinations is work or not.
Note: Every time we need to pass new data index and new value manually which is present in testing data. (because we previously used shuffle method.)
Holiday consideration (previous obs)
If alone there is event, price = 73.95 .....(impacted with event only +2)
[71.8224] ........dataframe -> holiday (1) + event (0) [71.8224] ........holiday (1) + event (0) [73.9591] ........holiday (0) + event (1) [73.9961] ........holiday (0) + event (0) [71.7774] ........holiday (1) + event (1)
Event consideration (previous obs)
If alone there is holiday, price = 71.61 .....(not much impacted with holiday only +1)
[71.8224] ........dataframe -> holiday (1) + event (0) [71.8224] ........holiday (1) + event (0) [73.9591] ........holiday (0) + event (1) [73.9961] ........holiday (0) + event (0) [71.7774] ........holiday (1) + event (1)
Business days consideration (previous obs)
If there is event and holiday, price = 92.58 ...... (impacted with event and holiday +9)
[71.8224] ........dataframe -> holiday (1) + event (0) [71.8224] ........holiday (1) + event (0) [73.9591] ........holiday (0) + event (1) [73.9961] ........holiday (0) + event (0) [71.7774] ........holiday (1) + event (1)
Holiday and event consideration (previous obs)
If no holiday and no event, price = 73.13 .....(impacted with both +1, but effect is smaller than alone event)
[71.8224] ........dataframe -> holiday (1) + event (0) [71.8224] ........holiday (1) + event (0) [73.9591] ........holiday (0) + event (1) [73.9961] ........holiday (0) + event (0) [71.7774] ........holiday (1) + event (1)
Observation: On multiple time simulation in historical test data, below are the observation,
This section proveides the information about which are the best parameters for given data and algorithm, so that we can improve the results from previous abse model. In base model, model parameter are consist of default values and data is fitted using default param. But in hyperparameter tunning, the searching algorithm finds the best situalble paramts rather than default one. So that using these best param, we can improve the results of model.
Hyperparameter tuning is an essential part of controlling the behavior of a machine learning model. If we don’t correctly tune our hyperparameters, our estimated model parameters produce suboptimal results, as they don’t minimize the loss function. This means our model makes more errors.
Methods of tunning:
1) Grid search
2) Random Search
3) Bayesian optimizations
ref: https://www.anyscale.com/blog/what-is-hyperparameter-tuning
Note: Commenting this while section as it takes all available machine CPU cores (all physical cores), avaliable memory (above 98% RAM) and lots of time (multiple hours depending upon the parameter define, its all combinations and total passed data size)
'''
Take lot of time to as per resource available
'''
With param tuuning / without param tunning
'''
Not applicable as depend on Section F)
'''
This section provides the simulation based on user inputs. User need to submit inputs and based on inputs, model is trying to predict the Room Rate.
'''
User inputs dated on 3rd May 2022
'''
User input accepted [Arrival date] User input accepted [depart date] User input accepted [Room type] User input accepted [Rate code] User input accepted [Track Code] ------------------------------------------------------------ Current date (today) : 2022-05-03 00:00:00 Check-in date : 2022-05-06 00:00:00 Check-out date : 2022-05-07 00:00:00 ------------------------------------------------------------ Expected check-in days : 3 Number of nights (stay) : 1 Occupancy rate (today) : 32 Occupancy rate (arrival) : 75 ADR rate (today) : 74.72 ADR rate (arrival) : 138.11 ------------------------------------------------------------ Room type : NQQ Rate code : BAR Track code : LEISURE
'''
Data of current date where user is booking the room (current day data)
'''
current year : 2022 current month : 5 current day : 3 current day week : 1 current day name : Tuesday current month name : May current weekend status : 0 current holiday status : 0 current holiday name : NA current event status : 0 current event name : NA
'''
Below is the predictions for next days
'''
| Date | Month | Day | Nights | RoomType | RateCode | TrackCode | Holiday | Event | OccupancyLevel | PredictedPrice | YieldPrice | ADR | YieldRatio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2022-05-03 | 5 | Tuesday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 32 | 94.69 | 83.33 | 74.72 | -0.12 |
| 1 | 2022-05-04 | 5 | Wednesday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 34 | 94.81 | 83.44 | 80.08 | -0.12 |
| 2 | 2022-05-05 | 5 | Thursday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 48 | 95.77 | 86.19 | 87.62 | -0.10 |
| 3 | 2022-05-06 | 5 | Friday | 1 | NQQ | BAR | LEISURE | 0 | 1 | 75 | 95.26 | 101.93 | 138.11 | 0.07 |
| 4 | 2022-05-07 | 5 | Saturday | 1 | NQQ | BAR | LEISURE | 0 | 1 | 72 | 95.32 | 102.00 | 133.76 | 0.07 |
| 5 | 2022-05-08 | 5 | Sunday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 11 | 95.04 | 85.53 | 74.62 | -0.10 |
| 6 | 2022-05-09 | 5 | Monday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 9 | 95.20 | 83.78 | 74.92 | -0.12 |
| 7 | 2022-05-10 | 5 | Tuesday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 3 | 98.86 | 87.00 | 74.53 | -0.12 |
| 8 | 2022-05-11 | 5 | Wednesday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 2 | 98.99 | 87.11 | 112.88 | -0.12 |
| 9 | 2022-05-12 | 5 | Thursday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 5 | 98.71 | 86.87 | 107.18 | -0.12 |
| 10 | 2022-05-13 | 5 | Friday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 11 | 94.85 | 85.36 | 160.30 | -0.10 |
| 11 | 2022-05-14 | 5 | Saturday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 11 | 94.87 | 85.39 | 159.11 | -0.10 |
| 12 | 2022-05-15 | 5 | Sunday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 5 | 99.99 | 87.99 | 105.71 | -0.12 |
| 13 | 2022-05-16 | 5 | Monday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 6 | 100.86 | 88.75 | 102.51 | -0.12 |
| 14 | 2022-05-17 | 5 | Tuesday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 7 | 101.17 | 89.03 | 104.74 | -0.12 |
| 15 | 2022-05-18 | 5 | Wednesday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 7 | 101.69 | 89.49 | 114.15 | -0.12 |
| 16 | 2022-05-19 | 5 | Thursday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 13 | 96.91 | 87.22 | 102.97 | -0.10 |
| 17 | 2022-05-20 | 5 | Friday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 9 | 96.80 | 85.18 | 131.24 | -0.12 |
| 18 | 2022-05-21 | 5 | Saturday | 1 | NQQ | BAR | LEISURE | 0 | 1 | 6 | 101.72 | 89.51 | 150.29 | -0.12 |
| 19 | 2022-05-22 | 5 | Sunday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 5 | 102.71 | 90.39 | 96.99 | -0.12 |
| 20 | 2022-05-23 | 5 | Monday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 5 | 103.01 | 90.65 | 74.53 | -0.12 |
| 21 | 2022-05-24 | 5 | Tuesday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 5 | 103.17 | 90.79 | 78.99 | -0.12 |
| 22 | 2022-05-25 | 5 | Wednesday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 4 | 103.22 | 90.84 | 85.28 | -0.12 |
| 23 | 2022-05-26 | 5 | Thursday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 2 | 102.95 | 90.60 | 64.90 | -0.12 |
| 24 | 2022-05-27 | 5 | Friday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 4 | 102.95 | 90.60 | 116.73 | -0.12 |
| 25 | 2022-05-28 | 5 | Saturday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 6 | 102.82 | 90.48 | 135.59 | -0.12 |
| 26 | 2022-05-29 | 5 | Sunday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 5 | 102.82 | 90.48 | 104.31 | -0.12 |
| 27 | 2022-05-30 | 5 | Monday | 1 | NQQ | BAR | LEISURE | 1 | 0 | 3 | 103.10 | 90.72 | 79.98 | -0.12 |
| 28 | 2022-05-31 | 5 | Tuesday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 2 | 103.10 | 90.72 | 103.33 | -0.12 |
| 29 | 2022-06-01 | 6 | Wednesday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 2 | 99.00 | 87.12 | 82.28 | -0.12 |
| 30 | 2022-06-02 | 6 | Thursday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 30 | 94.54 | 87.92 | 112.19 | -0.07 |
| 31 | 2022-06-03 | 6 | Friday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 39 | 94.31 | 89.59 | 150.65 | -0.05 |
| 32 | 2022-06-04 | 6 | Saturday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 54 | 95.57 | 97.48 | 153.80 | 0.02 |
| 33 | 2022-06-05 | 6 | Sunday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 10 | 95.08 | 83.67 | 108.64 | -0.12 |
| 34 | 2022-06-06 | 6 | Monday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 4 | 98.84 | 86.98 | 112.46 | -0.12 |
| 35 | 2022-06-07 | 6 | Tuesday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 5 | 99.04 | 87.15 | 115.44 | -0.12 |
| 36 | 2022-06-08 | 6 | Wednesday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 1 | 99.15 | 87.25 | 88.70 | -0.12 |
| 37 | 2022-06-09 | 6 | Thursday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 4 | 98.70 | 86.85 | 119.43 | -0.12 |
| 38 | 2022-06-10 | 6 | Friday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 4 | 98.71 | 86.87 | 112.78 | -0.12 |
| 39 | 2022-06-11 | 6 | Saturday | 1 | NQQ | BAR | LEISURE | 0 | 1 | 2 | 98.69 | 86.84 | 147.88 | -0.12 |
In this section, predictions visualization over date factor we can analyze.
'''
Line chart for next days prediction with detail information
'''
Observation:
'''
Line chart for occupancy level with next days predictions
'''
Observation:
'''
Bar chart for monthly averaged prediction price and yield price
'''
| PredictedPrice | YieldPrice | |
|---|---|---|
| Months | ||
| May | 99.357241 | 89.012759 |
| Jun | 98.413667 | 87.976333 |
| Jul | 99.849677 | 88.296774 |
| Aug | 97.320000 | 86.276667 |
Observation:
'''
Bar chart for daily averaged prediction price and yield price
'''
| PredictedPrice | YieldPrice | |
|---|---|---|
| Day | ||
| Monday | 99.406923 | 87.772308 |
| Tuesday | 99.442857 | 87.784286 |
| Wednesday | 98.889286 | 87.572857 |
| Thursday | 98.720000 | 87.831538 |
| Friday | 98.563846 | 89.075385 |
| Saturday | 99.098462 | 90.292308 |
| Sunday | 99.935385 | 88.235385 |
Observation:
'''
Bar chart for Event averaged prediction price and yield price
'''
| PredictedPrice | YieldPrice | |
|---|---|---|
| Event | ||
| 0 | 99.214382 | 88.049551 |
| 1 | 97.747500 | 95.070000 |
Observation:
'''
Bar chart for holiday averaged prediction price and yield price
'''
| PredictedPrice | YieldPrice | |
|---|---|---|
| Holiday | ||
| 0 | 99.106404 | 88.339663 |
| 1 | 100.150000 | 88.615000 |
Observation: